Linear regression for numeric symbolic variables: an ordinary 
least squares approach based on Wasserstein Distance 



Antonio Irpino, Rosanna Verde 

Dipartimento di Studi Europei e Mediterranei, Seconda Universit degli Studi di Napoli, Via del Setificio, 81100, 

Caserta, Italy 



Abstract 

In this paper we present a linear regression model for modal symbolic data. The observed vari- 
ables are histogram variables according to the definition given in Bock and Diday [1] and the 
parameters of the model are estimated using the classic Least Squares method. An appropriate 
metric is introduced in order to measure the error between the observed and the predicted dis- 
tributions. In particular, the Wasserstein distance is proposed. Some properties of such metric 
are exploited to predict the response variable as direct linear combination of other independent 
histogram variables. Measures of goodness of fit are discussed. An application on real data 
corroborates the proposed method. 
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1. Introduction 

In this paper we present a linear regression model for modal symbolic data in the framework 
of SymboUc Data Analysis (SDA). 

Since the first pioneering papers of Edwin Diday, SDA was become a new statistical field 
of research gathering contributions from different scientific communities: statistics, knowledge 
extraction, machine learning, data mining. Three European projects supported the systematic 
developments of the main methodological contributions and a wide collection of SDA methods 
is available in three reference books (Bock and Diday [1], Billard and Diday [2] and Diday and 
Noirhomme-Fraiture [3]). Moreover, many other papers have been published on Journals and 
Conference proceedings and currently SDA is almost always in the list of the topics of the most 
relevant international data analysis conferences. The basic ideas of SDA consists of analyzing 
data which are described by set-valued variables. Symbolic data refer to a description of a class, 
a category, a set of individuals and, more generally, they correspond to a description of a "con- 
cept". Examples of symbolic data are: football teams (classes of individuals); species of animals 
(categories); towns (concepts). Differently from the classic data where each "punctual" obser- 
vation assumes only one value for every variable, symbolic data take multiple values, such as 
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intervals of continuous variables, different categories of nominal variables, empirical distribu- 
tion or probability functions. 

Symbolic data are receiving more and more attention because they are able to summarize 
huge sets of data, nowadays available in large databases or generated in streaming by smart 
meters or sensors. 

A pecuharity of Symbohc data is that they allow to keep the variabihty in the data description, 
considering the interval of values that each observation can assume or the distribution of values. 
In this way, the methodological approaches developed in this context of analysis must guarantee 
this information is preserved. 

Symbolic Data Analysis methods generalize multivariate data analysis to that new kind of 
data and they can be classified at least in three categories, according to the input data - the 
method - the output data, as follows: 

symbolic-numerical-numerical: symbohc data in input are transformed into standard data in 

order to apply classic multivariate techniques. The results are classic data. For example, 
a dissimilarity between interval data is computed considering only the bounding values 
(minimum and maximum) of the intervals. That is a standard dissinnilarity between punc- 
tual data and the measure is a single value. 

symbolic-numerical-symbolic: symbolic data in input are analyzed according to a classic mul- 
tivariate techniques and the results are symbolic data. For example, intervals are trans- 
formed in mid-points and radii, a classic analysis (e.g. linear regression) is performed on 
these data but the results (e.g. predictive variable) are furnished in terms of intervals by 
reconstructing them from the estimated mid-points and radii. 

symbolic-symbolic-symbolic: symbolic data in input are transformed using generahzation/specialization 
operators and the results are symbolic data. For example, symbolic data represented by 
intervals, histograms or distributions are aggregated in homogeneous classes through a 
clustering methods using a criterion of homogeneity which takes into account the charac- 
teristics of the data (internal structure). The results are classes of symbolic data expressed 
by the same kind of variables of the symbolic input data. 

Most part of the SDA techniques are symbolic-numerical-symbolic, where the input and output 
data are symbolic and the methods are generalization of the classic data analysis methods to 
this new kind of data. Considerable contributions have been given in retrieving variability in- 
formation of the symbolic data through graphic representations of the results and tools for the 
interpretation. A large overview of the SDA methods is in Bock and Diday [1] and in Diday and 
Noirhomme-Fraiture [3]. The linear regression models proposed in this context of analysis have 
been introduced to study the structure of dependence of a response symbolic variable from a set 
of independent or explicative variables of the same nature. 

The first proposals were regression models for interval data as extension of the linear de- 
pendence model for numerical variables. Billard and Diday (2000) proposed a non-probabilistic 
approach based on the minimization of a criterion like the sum of squared error for the parame- 
ters estimation. The method consists in a regression model on the mid-points of the intervals and 
it makes reference to the basic statistics (mean, variance, correlation) introduced by Bertrand and 
Goupil [4] for interval data, de A. Lima Neto et al. [5], de A. Lima Neto and de A. T. de Carvalho 
[6] and de A. Lima Neto and de A. T. de Carvalho [7] improved the performance of the linear 
method adding the ranges of the intervals to the mid-points information in order to include the 

2 



variability expressed by the interval data. Therefore, the authors show the differences between 
this model and two regression models on the bounds of the intervals. Lima Neto et al. (2005) 
introduced an order constrained on the bounds of the predicted intervals in the model estimation 
process in order to guarantee the coherence between the observed and the predicted response 
variable. To overcome the problem of possible inversion of the bounds of the predicted intervals 
Lima Neto and De Carvalho (2008) suggested a non linear regression model on the mid-points 
and the ranges of the interval. Billard and Diday [2] presented at first a regression model for 
histogram variables. This approach is based on the basic statistics: mean, variance and correla- 
tion defined by Bertrand and Goupil [4] for intervals when they are assumed as random variables 
uniformly distributed. According to this approach the fitted regression model and their predicted 
values are generally single valued. The authors leave as an open problem output predicted values 
as symbolic data. 

Verde and Irpino [8] proposed a simple linear regression model which allows to estimate a 
histogram response variable as linear transformation of another independent histogram variable. 
The main idea is to propose a suitable metric to measure the sum of squared errors between the 
observed and predicted multi-valued data (histograms or distributions). The Wasserstein distance 
[9] { isa distance function defined between the probability distributions of two random variables 
X and Y, on a given metric space M. The minimal Li -metric {\ had been introduced and inves- 
tigated already in 1940 by Kantorovich for compact metric spaces Kantorovich [10]. In 1914 
Gini introduced the i metric in a discrete setting on the real line [11] and Salvemini 1943 (in the 
discrete case) [12] and Dall'Aglio 1956 (in the general case) [13] proved the basic representa- 
tion of Lp norm {p between the quantile functions of the two random variables. Mallows [14] 
introduced the ^2-metric in a statistical context. Moreover, starting from Mallows' work Bickel 
and Freedman [15] described topological properties and investigated applications to statistical 
problems as the bootstrap. They introduced the notion Mallows metric for {2. So the -metric 
was invented historically several times from difl'erent perspectives. Historically the notion of 
Gini - Dair Aglio - Kantorovich - Wasserstein - Mallows metric would be correct for this class 
of metrics. 

This measure that, we refer to henceforth as Wasserstein metric and that was already pro- 
posed by the authors in Clustering methods for interval [16] and histogram data [17, 18], seems 
particularly adapt in this context. According to this distance function, we study the dependence 
relationship of the histogram response variable from the explicative one considering the respec- 
tive quantile functions. 

Dias and Brito [19] referring to this last approach proposed a linear regression model for 
histogram data, directly interpreting the linear relationship between quantile functions. In the 
multiple regression model, as we will show in the present paper, one of the main problem is 
OLS cannot guarantee all the estimated parameters are positive. It is worth nothing that the 
predicted response variable is again a quantile function only if it is a linear combination of 
quantile functions with positive coefficients. In order to overcome such inconvenience Dias and 
Brito [19] proposed to introduce the so called symmetric quantile distributions in the model 
as new predictor variables. However the meaning of these new variables is not inmiediately 
interpre table. In the same paper a new measure of goodness-of-fit associated to the proposed 
model is also introduced. 

Difl'erently, our proposal is to exploit the properties of a decomposition of the Wasserstein 
proposed distance by Irpino and Romano [20], that is used to measure the sum of squared errors 
and rewrite the model splitting the contribution of the predictors in a part depending from the 
averages of the distributions and another depending from the centered quantile distributions. 
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The parameters associated to the predictors, constituted by the averages of the distributions, 
can be indifferently positive or negative because they effect only on the shift of the predicted 
quantile distribution. The authors already demonstrated for the simple regression model Verde 
and Irpino [8] that this leads to guarantee the positiveness of the parameter associated to the 
centered quantile function of the only predictor. However, in the multiple regression model, this 
solution is not automatically obtained, so that it needs to force the the positiveness of the multi 
parameters estimation by a Non Negative Least Squared algorithm. That is a classic algorithm 
that finds the solution among all the subsets of suboptimal OLS solutions. The rest of paper is 
organized as follows: in section 2 symbolic data are presented according to the definition given 
in Bock and Diday [1] and Diday and Noirhomme-Fraiture [3] books; in section 3 regression 
models for Numerical Probabilistic (Modal) Symbolic Variables are introduced and details on 
the new proposal are furnished; in section 4 some goodness of fit indices are proposed; in section 
5 applications on real data are presented in order to corroborate the procedure. 

2. Numerical symbolic data 

Symbolic data allow to describe concepts, individuals or classes of individuals, by means 
of multiple values for each descriptor (variable). The term symbolic variable was coined in 
order to introduce such new set-valued descriptions. In a classic data table (nx p individuals 
per variables) each individual is described by a vector of values, similarly, in a symbolic data 
table each individual is described by a vector of set-valued descriptions (Uke intervals of values, 
histograms, set of numbers or of categories, sometimes equipped with weights, probabilities, 
frequencies, an so on). According to the taxonomy of symbolic variables presented in Bock and 
Diday [1] and recalled by Noirhomme-Fraiture and Brito [21], we may consider as numerical 
symboUc variables aU those symbolic variables whose support is numeric. 

Given a set of n individuals (concepts, classes) Q. - {u\, . . ., aj„] a symbolic variable X with 
domain D is a map 

X:D.^D X(coi) € D. 

The different kinds of variable definitions depend on the nature of D. Considering only 
numerical domains, we can define the following symbolic variable: 

Classic Variable It is observed when D Q %, i.e. each individual o), is described by a single 
numeric value for the variable X; 

Interval Variable It is observed when D c /R, where /M is the set of all intervals of real 

numbers [a, b] where a,b e !Rand a < b; 

Modal Variables According to Bock and Diday [1] the domains of Modal Variables are sets 
of mappings. Considering different kinds of mapping, several kinds of Modal Symbolic 
Variables can be defined, let us consider D Q M where M, 6 M is a map M, : 5 ,■ — > W,-, 
such that for each element of the support e 5, it is associated w, = M;(i,) e %^ . 
If Miisi) has the same properties of a random variable (i.e. w{s)ds = 1, 
1 , ), X can be defined as a Numerical Probabilistic (Modal) Symbolic Variable (NPS V) and 
Mi(si) can described through a probabihty density function fiix). Particular cases of such 
data arise when the generic individual is described by a model of random variables, a 
histogram, an empirical frequency distribution. In this paper we refer only to such kinds of 
data that we call Numerical Probabilistic Symbolic Data (NPSD), that are in the domain 
of Numerical Probabilistic (Modal) Symbolic Variables. 
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For example, if w, for variable X is described by a normal distribion with parameters jU, and cr,-, 
we may describe it by its probability density function {pdf) fi(x) as follows: 

X{oJi) = fi{x) = [N(4ii,ad]. 

Using the same notation, and according to Bertrand and Goupil [4] and Billard and Diday 
[2], we may consider interval data as a particular case of NPSD, where the pdf is uniform. Given 
an interval description of w,as X(a>,) = [a,, fc,], we may rewrite the same description in terms of 
NPSD as: 

Histogram data are a particular case of NPSD, where, given the generic individual w, , a set 
of disjoint Ki intervals I^t = [ai^, bki\ k = l,...,Ki and a set of positive Kj weights such that 
Z£i Wfa = 1 , its description for the NPSV X is : 

^im) = fiix) = {{hi, Wu) (hi, Wki) (iKii, WKii)} . 

Also in this case it is possible to define a pdf for each histogram data as proposed by Irpino 
and Verde [17], considering a histogram as a mixture of uniform pdf's. 

In another way an interval can be treated as a histogram with k = I, such that lu = [a,, fe,] 
and wu = 1. 

Similarly, classic data (single valued numerical data), can be considered as NPSD, where the 
description is a Dirac delta function or as histogram data with one thin (a, = bi) interval. 

Notation and definitions. Let X\, . . . ,Xj, . . .Xp and Y be the independent and dependent NPSV's 
observed on a set of « individuals (concepts or classes). We denote with: 

• fiixj) and fiy) the empirical or theoretical probability density functions {pdf's), i.e. NPSD 
describing the i — th individual (for ; = !,...,«); 

• Fi(xj) and Ffy) the cumulative distribution functions (cdf's); 

• Xij(t) - Fj^(xj) andy,(0 = Fj^iy) the quantile functions (qf's); 

• Xij, yi and Sij, \ the means and the standard deviations of the Xij{t)'s and of yi{t) respec- 
tively. They are real numbers; 

• Xjif) - i Yli^i Xijit), yit) - i Y!i=i yiit) the means of the sets of n distributions Xijit) and 
yiit), i.e. the baricenter distributions; 

11 11 
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distribution means of the x,/f)'s and yi{t), or equivalently the means of the baricenter 
distributions of the and y,(f)- They are real numbers; 

• x'j-it) = Xijit) - Xij the centred quantile function , i.e. the quantile function shifted by 



Jxij(t)dt,yi = J yi(t)dt and Sij = Jj :,^j(t)dt - ^j, /. = Myf(t)dt-ff 



3. OLS linear regression for NPSD 

Given Xy, . . . ,Xp p explicative NPS V's and a dependent NPS V Y observed on a set Q, the aim 
is to fit the parameters of a linear regression function Denoted with X and Y respectively 
the matrix collecting the observed values of the Xj explicative NPSV's and the vector of the 
observed values of the predictive NPSV F, we write the regression model as foUows: 

Y = <*(X) + e (1) 

As for classic data, and according to the definitions of NPSD's, the model is fitted starting 
from the following symbolic data table, where instead of a matrix of scalar values, we have a 
matrix of NPSD's: 







Mxi) ■ 


■ Mxj) ■ 


■ MXp) ' 


[Y|X] = 


fi(y) 


Mxi) ■ 


■ fiiXj) ■ 


■ fiiXp) 




- My) 


Mxi) ■ 


■ fn(Xj) ■ 


fniXp) . 



Two main approaches for the estimation of the parameters of linear regression model have 
been proposed when the symboUc data are histograms. Starting from the elementary statistics 
proposed by Bertrand and Goupil [4], in a first approach Billard and Diday [2] introduced an 
extension of the classic OLS (Ordinary Least Squares) Unear regression model to the histogram- 
valued variables. A second group of approaches is based on the use of the quantile functions 
(which is biunivocal to a pdf) of the NPSD and of the Wasserstein distance for defining the 
sum of square errors in the OLS problem. The idea behind the latest approaches is to predict a 
quantile function after having observed a set of quantile functions as predictors. 



3.1. The Billard-Diday model 

According to Billard and Diday [2], the regression model which expresses a linear relation- 
ship between a set of predictors and a response histogram variable is based on the assumptions 
of Bertrand and Goupil [4], they consider a histogram as the representation of a cluster of indi- 
viduals. A second implicit assumption is that the histograms are the marginal distributions of a 
multivariate distribution with independent components: i.e., given the ; - th description, and the 
two fi(x) and fiiy) pdf's, and the joint pdf is expressed as fi(x,y) = fi(x) ■ fiiy). In this approach, 
interval data are considered as uniform pdf's, and as a particular case of histogram with just one 
interval with unitary weight. The regression method is based on the identification of the covari- 
ance matrix that depends on the following basic statistics; being Y and Xj the NPSV observed 
for n individuals, the means and the variances of each variable are computed according to: 

y= fy mdy = - V fy mdy = - V (3) 

4-co ^ +CO 
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and considering that 



1 " 1 " 

fix, y) = - y /icx, y) = - y m ■ m (5) 

n n ^ 



i=l !=1 

the covariance measure proposed by Bertrand and Goupil [4] is 



+00 +00 ^ 

x.y = J J x-y-fix,y)dxdy-x-y = ^Y^Xiyi-x- 



y. (6) 



It is important to note that in general + Sx,x- Billard and Diday [2] proposed a different 
way to compute Sx,y when data are intervals, considering them as uniform distributions as well 
as when data are histograms, considering them as weighted intervals. 

The proposal extends the hnear regression model for standard data, according to the following 
equation: 

p 

y^PQ + YjPjXj + e (7) 
where p's are estimated by means of the classic OLS estimators as follows: 



" Pi 




r 


^Xi,Xj 


^Xi,Xp 


-1 


h,xi 


Pj 




^Xj,Xl 


■ 4- ■ 

Xj 


^Xj,Xp 




■h:x, 


. Pp . 






^Xj,Xp 


■ 4 

Xp J 




■ ^y,Xp - 



p 

Po = y-YjPj^j 

The estimated model allows to predict % as follows 

p 

yi^Po + YjPj^iJ (8) 

Therefore, that is a method for predicting single values but not directly distributions (this 
point is also considered by the authors). In general, it is difficult to express the distribution 
of a Unear combination of random variables, in particular when the random variables are not 
identically distributed. In general, an approximation of the distribution associated to y can be 
obtained by means of a Montecarlo experiment. 

On the other hand, if a set of strong conditions hold (the knowledge of the cardinality of 
groups, the internal independence of the multivariate distribution in each group), the model pa- 
rameters have the same inferential properties of the classic OLS linear regression estimation 
method. 
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3.2. Wasserstein distance based models 

We have recalled that the Billard and Diday [2] model is implicitly founded on the model- 
ing of the union of groups or concepts and it is mainly based on the basic statistics proposed 
by Bertrand and Goupil [4]. For example if there are two groups G\ and G2 of people with the 
same cardinahty, described by their income distributions fiilNCOME) and fiilNCOME), all 
the basic statistics correspond to the classic statistics calculated for the mixture of distributions. 
From this point of view the basic statistics of the group G\ |J G2 are those calculated considering 
the pdf of the variable WCOM£ as: f {INCOME) = 0.5fi(INCOME) + 0.5fi{INCOME). This 
situation arises frequently when the aim is to describe unions of groups (for example, municipal- 
ities are grouped into cities). In other cases, this approach can be inconsistent. For example, we 
cannot know the cardinahty of the groups or it makes no sense to know the number of the elemen- 
tary observations: if we take several pulse rate measurements of two individuals a>i and C02 and 
we fit a distribution or a histogram f\ (PulseRate) and f2(PulseRate) for each one of them, we 
may be interested to discover relations between the two individuals by means of the comparison 
of their (probabilistic) respective distributions, instead of considering a mixture of distributions 
(that is also a logical non sense, two individuals cannot be fused into a super individual!). 
In this sense Verde and Irpino [18] proposed a different approach based on the comparison of 
distributions by means of suitable dissimilarity measures. Verde and Irpino [18] considered dif- 
ferent kinds of probabilistic metrics for histogram data and suggested that the same results can 
be extended to data described by density functions (i.e., NPSD). Among the discussed metrics, 
the €2 Wasserstein [9] distance permits to explain and interpret in an easy way the proximity 
relations between two probability functions. Given two pdf's /(x)and i>(.\), with means Xf and 
Xg, finite standard deviations Sf and Sg it is possible to associate respectively their cfd's F(x) and 
G{x). With each cdf's it is associated their quantile functions (qf), i.e. the inverse functions of 
the cdf: Xf{t) = F~^{t) and Xg{t) = G"'(0- The {2 Wasserstein distance is the following: 



dwif,8) 



j [Xfit) - Xgit)] 



dt. 



(9) 



The (2 Wasserstein distance is proposed for calculating the square errors in the OLS problem. 
Given the matrix (2), we consider the associated matrix M containing the corresponding quantile 
functions: 



(10) 



In this case, given a set of p quantile functions for the i - th individual, we look for a linear 
combination of Xij(tys (for j = 1, . . .,p) which allows to predict the >',(f)'s (for / = !,...,«) 
except for an error term e,(f). It is worth noting that e,(0 is a residual function, not necessarily a 
quantile function The model to be fit is the following: 





^1(0 


xn(t) ■■ 


■ Xij(t) ■■ 


■ Xip(t) 


M = [ Y 1 X ] = 


yi(t) 


xa(t) ■ 


■ Xij(t) ■ ■ 


Xipjif) 




. yn(t) 


Xnlit) ■■ 


■ X„j{t) ■ ■ 


Xnpif) . 



ym =/3o + Y^pjXijit) + em, 
7=1 



(11) 
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where the Sum of Squared Errors (S S E) to minimize for the solution of the OLS problem is 
related to the Squared £2 Wasserstein distance as follows: 



SSE = J] [eiit)f = yiit)' A + YjPjXijit) 



1=1 



1=1 



(12) 



A problem arises for the linear combination of quantile functions: only if /?j > (j = 1, . . . , /?) it 
is assured thaty,(0 is a quantile function (i.e. a not decreasing function). In order, to overcome 
this problem, Dias and Brito [19] proposed a novel method for the regression of histogram valued 
data based on the Wasserstein distance between quantile functions. In order to take into account 
also inverse casual relations, Dias and Brito [19] proposed to expand the matrix M adding also 
the quantile functions of the symmetric distributions of the explicative symboUc variables. Given 
fi(xj) (with the respective quantile function Xij(t)), the corresponding symmetric distribution 
fi(xj) (and its quantile function x,j(f)) is obtained by multiplying the support of fi(xj) by - 1 , such 

1 

that the integral of the sum of the two quantile functions is equal to zero (J [x,;(f) + x,y(f)] dt = 0). 



The model is the following: 



yiit) =/3o + Y^PjXijit) + YjPjXiAt) + etit), 



(13) 



and the estimation of the parameters is obtained optimizing the following constrained OLS 
problem: 



argminSS = /i^ 



I 



yiit), 



s.a 



i=\ 

> 



/3o + j]/ijXij(t) + j]^jxij(t) 
7=1 7=1 



3.3. The Irpino and Verde model for multiple regression 

The negative value of the parameter Pj in model (11) is in general not acceptable when 
dealing with quantile functions. In order to overcome this inconvenient, starting from a partic- 
ular decomposition of the Wasserstein Verde and Irpino [8] presented a new formulation of the 
problem in the simple regression model, that is when only one variable X affects the predicted 
variable Y. An extension of this approach to the multivariate case has been already presented by 
the authors [22] and it takes into consideration the matrices in equation (19). 

The introduction of the new model can be done according to some preliminary considera- 
tions about the properties of (2 Wasserstein distance decomposition. Given / and g two NPSD 
and x^(f) and x!g{t) the respective centred quantiles functions (above defined in §2 Notation and 
definitions) , Cuesta-Albertos et al. [23] showed that the (2 Wasserstein distance can be rewritten 
as 



S) = [xf - Xgf + J [x}it) - xlit)f 



dt. 



(14) 



This property allows to consider the squared distance as the sum of two components, the first 
related to the location of NPSD and the second related to their variability structure. Irpino and 

9 



Romano [20] improved such decomposition, showing that the can be finally decomposed into 
three quantities: 

dwif' S) = [xf - Xgf + [sf - Sgf + IsfSg (l - p{xf, Xg)) (15) 
where p{xf, Xg) is a correlation coefiicient about the quantile functions, i.e.: 

L Xf{f} ■ Xg{f)dt — Xf ■ Xg 
p{xf,Xg) = . (16) 

Irpino et al. [24] showed some computational aspects relating to p when data are histograms 
and they showed that p is computed in a linear time with the total number of bins of the his- 
tograms. Moreover, the Equation 16 allows to define the inner product between two ^'s, as 
follows: ^ 

{xf{t), Xg(f)) = Xf{t) ■ Xg{t)dt = pixf, Xg) ■ Sf Sg + Xf ■ Xg. (17) 

Thus, given two vectors of quantile functions x = [xKOJwxiand y - [yKOJnxi, we can define 
the scalar product of two vectors of NPSD as: 

n 

x'^y = ^ [pixi, yi) ■ Sjci • Sy. + Xi ■ . (18) 
1=1 

If we consider that X;^(0 - Xij{t)-Xij, each element of X can be rewritten as x,//) = x'^^p)+Xij. 
The same is vahd for vector Y. Matrix M is transformed as: 

M = [ Y + ¥*= I X + X-^ ] = [ Y I X ] + [ ¥*= I X*^ ] (19) 

where Y - Ly/J^xi is the vector of the means of the fiiy) , Y"^ = [^/^(Oj^^j is the vector of 
the centred quantile functions of /i(y)'s, X = [.^o]„xp matrix of the means of the fiixj), 

~ [•^/^-'inx;; matrix of the centred quantile functions of fiixjYs. 

We assume that each quantile function y,(0 can be expressed as a linear combination of the 
means x, j and of the centred quantile functions x?.(0 plus an error term e,(?) (which is a function) 
as follows: 

p p 

If we consider the matrix X+ = [1|X], we can rewrite the model in (20) using the matrix 
notation as follows: 

Y = X+B + Xr + e (21) 

In order to estimate the parameters, we define the Sum of Square Errors function (SSE) like 
in the OLS method, using the Wasserstein (2 measure: 



1 „ n r , 

(22) 



1=1 Q 1 = 1 Q [ 7=1 7=1 
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in matrix form: 

SSE{B,r) = e^e = [y - X+B - XT]^ [y - X+B - XT] (23) 

Considering the equation (21) induced from the £2 Wasserstein metric, we have X+X*^ = 
0(;,+i)xp, X^: Y = X^ Y and X'^^ Y = X'^ Y'^l Then, SSE in equation (23) can be decomposed into 
two positive quantities as follows: 

55£(B,r) =55£(B) + 55£(r) = e^e + (e'^)^e'^ (24) 

where: 

e = Y-X+B (25) 

= Y'-XT (26) 

with e = [Cilnxi a vector of real numbers. 

We may express the single minimization problem as the minimization of two independent 
functions: the first one related to the means of the predictor quantile functions x,/s in X+, and 
the second one related to the variabiUty of the centered quantile distributions x^j(0's in X'. Then 
two models are independently estimated: 



Y = X+B + e (27) 
Y'^ = XT + 6*^ (28) 

The first can be solved as classic OLS problem for the estimation of B: 

argmin 5 5 (B) = [y - X+B - el^ [y - X+B - el (29) 

B 

The second (30) is solved using the NNLS (Non Negative Least Squares) algorithm proposed 
by Lawson and Hanson [25], modified with the introduction of the product between quantile 
functions (Eq. 17) in the classic matrix computations: 



argmin S Sir) = [Y' - XT - 6*=]^ [Y' - XT - e'] (30) 
r 

s.a. yj>Oj=l,...,p. 
The estimated OLS parameters are: 

6 = (X^X)^^X^Y (31) 

f = (X'=^xy ' X'^Y^ (32) 

Therefore, a qf ytit) is predicted by the estimated linear model according to the estimated 
parameters: 

p p 

Ut) = Vi + y-it) =00 + Y^^jXij + yj^ui^)- (33) 

7=1 7=1 



^Some algebraic details are in appendixAppendix A 
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4. The goodness of fit evaluation 

Considering the nature of the data, the evaluation of the goodness of fit of the model is not 
straightforward like for the classic linear regression models. Here we present three indices that 
can be used for evaluating the goodness of fit of a regression model on NPSD. The first is the 
Q measure proposed by Dias and Brito [19], the second is the Pseudo - proposed by Verde 
and Irpino [8] and the last is the classic square root of the mean sum of squares expressed using 
the (2 Wasserstein distance (we denote is as RMSEw). The three measures are detailed in the 
following. 

ilDias and Brito [19] The proposed measure is the ratio 



a = 



(34) 



that varies from zero to 1 . 

Pseudo - It is a measure proposed by Verde and Irpino [8] for the simple linear regression 
model. Verde and Irpino [26] proved that the Wasserstein distance can be used for the 
definition of a sum of squared deviation as follows: 

1 

n " r 

SSY = n-sl = J]dl (y,(0, y(0) = J] I b"<^) " (^5) 
(=1 (=1 "5 

where y{t) is the baricenter of NPSD jXO's as defined at the end of section 2. 

A common tool for the evaluation the goodness of fit of the model is the well known 

coefficient of determination (R^ = ||| or = 1 - ||y 

Computation of this coefficient is based on the partitions of the total variation in the depen- 
dent variable, denoted SST, into two parts: the part explained by the estimated regression 
equation, denoted SSR, and the part that measures the unexplained variation, SSE, referred 
to as the residual sum of squares: 



SSY = SSE + SSR 



(36) 



In our case, in general, the equality does not hold, and we prove^ that the decomposition 
of 5 57 is the following: 



1 1 

n ^ n p 

SSY = Y, J \yiit)-yiit)fdt+Y, J \yit)-yiit)fdt- 




(37) 



Bias 



^See appendix Appendix B. 
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The bias term rVSS(r) in eq. (37) reflects the impossibihty of the Hnear transformation 
of x(f) of reflecting the variabihty structure of yit). In general, this term goes to zero when 
NPSD have the same shape (i.e., from the third ones forward, the standardized moments 
of the histograms are equal) and the standard deviations of /iy(x)'s are proportional to the 
standard deviations of the fiiyYs. Further, the term rV55(r) indicates the impact of the 
suboptimal solution obtained using the NNLS and is equal to zero only when the y/s are 
the same of those calculated without the non negativeness constraint. 
In this case, the classic - I - ||y statistic can be less than zero or greater than 1 . In 
order to obtain a measure of goodness of fit that does not suffer of the described drawback, 
we propose to adopt the following general index, that takes values in the real interval 

0,1 



SSE 



SSY 



PseudoR^ = min max 0; 1 - 
Denoting with Bias = rVSS(r), this index is also equal to: 

2 bias 

PreudoR 



; 1 



SSY 



(38) 



(39) 



RMSE The Root Mean Square Error is generally used as measure of goodness of fit. Choosing 
an appropriate measure for computing the distance between NPSD's, we may compute 
the RMSE. In this paper, having used the Wassertein distance, we propose the following 
measure for the RMSE: 



RMSEw = \ 



j:J(yiit)-yiit)fdt 

'=io jSSE 

n V n 



(40) 



5. Application on real data 

To illustrate the proposed method we choose some examples presented in the literature on 
chnic data and a new climatic dataset. Expecially, we make use only of dataset of NPSD de- 
scribed by histograms usually arisen as summaries of large amount of data. 

The first dataset is presented by Billard and Diday [2, Chap. 6, Table 6.8] and also presented 
as application in Dias and Brito [19]. In the dataset presented in table 1 there are the Hematocrit 
(Y) histogram NPSD and the Hemoglobin (X) histogram NPSD observed for 10 units. 

Fig. 1 shows the graphical representation of table 1 and the graphic representation of the 
means (barycenter) NPSD of each histogram variable, according to the barycenter of histogram 
variable as presented in [26]. 



Table 2 specifies the main summary statistics for the two histogram variables using the Bil- 
lard and Diday [2] set of summary statistics and those proposed by Verde and Irpino [26]. For 
the barycenters the mean and the standard deviation are only reported. Obviously, it is possible 
to report also the other moments of the barycenters, but for the sake of brevity we prefer to leave 
to the reader further considerations looking directly to their graphical representations in fig. 1. 
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Y:Hematocrit X: Hemoglobin 




30 35 40 45 50 10 12 14 16 



Figure 1: Blood dataset. Histogram representation, in the last row are represented the barycenter histograms. 
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Units Y: Hematocrit 



X: Hemoglobin 



1 {[33.29;37.57[, 0.6 ; [37,52; 39.61], 0.4} 

2 {[36.69; 39.11[ , 0.3; [39.11; 45.12] , 0.7) 

3 ([36.69; 42.64[ , 0.5; [42.64; 48.68] , 0.5) 

4 {[36.38; 40.87[ , 0.4; [40.87; 47.41] , 0.6) 

5 {[39.19; 50.86] , 1) 

6 {[39.7; 44.32[ , 0.4; [44.32; 47.24] , 0.6) 

7 {[41.56; 46.65[ , 0.6; [46.65; 48.81] , 0.4) 

8 {[38.4; 42.93[ , 0.7; [42.93; 45.22] , 0.3) 

9 {[28.83; 35.55[ , 0.5; [35.55; 41.98] , 0.5) 

10 {[44.48; 52.53] , 1) 



{[11.54; 12.19[ , 0.4; [12.19; 12.8] , 0.6) 
{[12.07; 13.32[ , 0.5; [13.32; 14.17] , 0.5) 
{[12.38; 14.2[ , 0.3; [14.2; 16.16] , 0.7) 
{[12.38; 14.26[ , 0.5; [14.26; 15.29] , 0.5) 
{[13.58; 14.28[ , 0.3; [14.28; 16.24] , 0.7} 
{[13.81; 14.5[ , 0.4; [14.5; 15.2] , 0.6) 
{[14.34; 14.81[ , 0.5; [14.81; 15.55] , 0.5} 
{[13.27; 14.0[ , 0.6; [14.0; 14.6] , 0.4) 
{[9.92; 11.98[ , 0.4; [11.98; 13.8] , 0.6) 
{[15.37; 15.78[ , 0.3; [15.78; 16.75] , 0.7} 



Table 1: Blood dataset; 10 units described by two histogram- valued variables. 



Y:Hemoglobin X: Hematocrit 



n 




10 




IVIean (BD) 


42.26 




14.05 


Barycenter mean(VI) 


42.26 




14.05 


Barycenter std (VI) 


2.660 




0.622 


Standard deviation (BD) 


4.658 




1.355 


Standard deviation (VI) 


3.824 




1.204 


Correlation (BD) 




0.903 




Correlation (VI) 




0.979 





Table 2: Blood dataset: summary statistics. BD refers to the approach of [2], while VI refers to the approach of [26] 



In Tabs. 3, 4 and 5, the rows labeled with Observed reports the OLS estimates of the param- 
eters of the regression models using the formulation of Billard and Diday [2] as described in 3 . 1 , 
the formulation of Dias and Brito [19] as presented in 3.2 and the novel formulation as proposed 
in 3.3. 

In order to compare the three models, we computed the goodness of fit measures presented in 
4. We remark that Billard-Diday model does not allow directly to predict a distribution function 
or a quantile function associated with fiiy), thus, for computing the goodness of fit indices of the 
Billard-Diday model, we performed a IVIontecarlo experiment in order to estimate the predicted 
fi(y) distribution. 

In order to calculate the confidence interval of the parameters of the three models, and consid- 
ering the complexity of establishing manageable probabihstic hypotheses on the error functions 
e,(?) like in the classic Unear regression estimation problem, we have performed the bootstrap 
[27] estimates of the parameters of the models by constructing 1 , 000 resamples of the observed 
dataset (and of equal size to the observed dataset), each of which is obtained by random samphng 
with replacement from the original dataset. The confidence interval of each parameter is calcu- 
lated using the percentile method, i.e. by using the 2.5 and the 97.5 percentiles of the bootstrap 
distribution as the hmits of the 95% confidence interval for each parameter. 

As usual, the point estimates is the mean bootstrap value. The main results about the param- 
eters and the goodness of fit indices are shown in Tables 3,4 and 5. 

The results show that the Dias-Brito and the novel proposed method fit better than the Billard- 
Diday model, and that the Dias-Brito and the proposed method have negUgible difl'erences con- 
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Billard-Diday Model: yi -00+ 0Xi 





Model parameters 




Goodness of fit 








Pseudo/?^ 


RMSEw 


Observed 


-1.336 


3.103 


0.811 


0.919 


3.431 






Bootstrap estimates 




Mean 


2.224 


2.851 


0.692 


0.812 


3.860 


Bias 


3.561 


-0.253 








SE 


6.592 


0.461 








2.5% 


-4.528 


1.609 


0.197 


0.155 


2.709 


97.5% 


20.243 


3.318 


0.888 


0.953 


6.854 



Table 3: Blood dataset: BIllard-Diday model parameters estimated on the fuU dataset and bootstrapping the dataset. 



Dias-Brito Model: yi{t) = /3o + 0i ■ Xi{t) + ySi • x,(f) 



Model parameters 



Goodness of fit 





00 


01 


01 ^ 


PseudoT?^ 


RMSEw 


Observed 


-1.953 


3.560 


0.413 0.963 


0.945 


3.431 








Bootstrap estimates 




Mean 


-1.657 


3.574 


0.448 0.963 


0.935 


2.623 


Bias 


0.296 


0.014 


0.035 






SE 


2.862 


0.165 


0.164 






2.5% 


-5.848 


3.255 


0.217 0.928 


0.823 


1.671 


97.5% 


5.037 


3.931 


0.848 0.986 


0.981 


3.345 



Table 4: Blood dataset: Dias-Brito model parameters estimated on the full dataset and bootstrapping the dataset. 



Irpino-Verde model:yi(t) = fio + 0i ■ Xi + ji ■ xj'(f) 



Model parameters Goodness of fit 









71 n 


Pseudo/?^ 


RMSEw 


Observed 


-2.157 


3.161 


3.918 0.961 


0.943 


2.892 








Bootstrap estimates 




Mean 


-1.928 


3.146 


3.969 0.961 


0.931 


2.688 


Bias 


0.229 


-0.016 


0.051 






SE 


2.833 


0.199 


0.269 






2.5% 


-6.348 


2.688 


3.602 0.924 


0.812 


1.730 


97.5% 


4.644 


3.462 


4.710 0.985 


0.980 


3.403 



Table 5: Blood dataset: Irpino-Verde model parameters estimated on the full dataset and bootstrapping the dataset. 
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f EPA ♦ NPS 

Figure 2: Ozone dataset. EPA and NPS monitored sites. 

sidering the goodness of fit. The difference is about the interpretation of the regression parame- 
ters. The parameters of the Billard-Diday model are interpretable as for classic regression but the 
dependent predicted distributions cannot be easily described. The Dias-Brito model parameters 
show that a unitary (positive) variation of all the quantiles of the Hematocrit induces a variation 
of 3.574 of the Hemoglobin quantiles, while a unitary increase of the "symmetric" Hematocrit 
quantiles induces an increase of the Hemoglobin quantiles equal to 0.448, this can be a bit con- 
fusing considering that there is a positive effect both on the original dependent variable and on 
its symmetric version. 

The interpretation of the Irpino- Verde model is different as it takes into consideration the two 
components of the NPSD: the variability of the means and the variability of the centered NPSD 
(the variability of the distributions variabihty). The estimated model enounces that a unitary vari- 
ation of the mean of the Hematocrit induces a variation of 3 . 146 in the mean of the Hemoglobin, 
while an increasing of one in the variability of the Hematocrit produce an increase of 3.969, in 
average, of the variability of the Hematocrit. 

The second dataset derives from the Clean Air Status and Trends Network (CASTNET)"*, an 
air quality monitoring network of United States designed to provide data to assess trends in air 
quality, atmospheric deposition, and ecological effects due to changes in air pollutant emissions. 
In particular, we have chosen to select data on the Ozone concentration in 78 USA sites among 
those depicted in Fig. 2 for which the monitored data was complete (i.e. without missing values 
for each of the selected characteristics). 

Ozone is a gas that can cause respiratory diseases. In the literature there exists studies that 
relate the Ozone concentration level to the Temperature, the Wind speed and the Solar radiation 
(see for example [28]). 

Given the distribution of Temperature (Xi) (Celsius degrees), the distribution of Solar Radi- 
ation (X2) (Watts per square meter) and the distribution of Wind Speed (X3) (meters per second). 



^http://java.epa.gov/castnet/ 
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SITEID 
ABT147 

ACA416 

ALC188 

ALH157 

ANA115 

ARE128 

ASH135 

BBE401 

BEL116 

BFT142 

BVL130 

BWR139 

GAD160 

CAN407 

CDR119 

CDZ171 

OHA467 

CHE185 

CKT136 

CJSrD125 

CNT169 

CON186 

COW137 

CTHllO 

CVL151 

DCP114 

Baiyc. 




Figure 3: Ozone dataset: monitored sites from 1 to 26. 



the main objective is to predict the distribution of Ozone Concentration (Y) (Particles per bilUon) 
using a linear model. CASTNET collect hourly data and as period of observation we choose the 
summer seasons of 2010 and the central hours of the days (10 a.m. - 5 p.m.). 
For each sites we have collected the NPSD of the four variables in terms of histograms and in figs. 
3, 4 and 5 we present the monitored sites using their histograms^, in order to have a reference we 
report in the last rows the barycenters that are better shown in fig. 6. 

In table 6 we reported the main summary statistics for the four histogram variables, while in 
figure 6 are drawn the four barycenters where it is possible to observe the average distributions 
(in the sense of Verde and Irpino [26]) of the 78 sites for each variable. 

We can note, for example, the different skewness of the barycenters, in general when the 
barycenter is skew, we may observe that the NPSD are in general skew in the same direction. 
This is not in general true for symmetric barycenters, that can be generated both from left and 



'We can supply the full table of histogram data, the Matlab routines and the workspaces upon request. 
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SITEID 
DEN417 

ESP127 

GAS153 

GLR468 

GRB411 

GRS420 

GTHiei 

HOW132 

HO XI 48 

HWF187 

IBL141 

JOT403 

KEF112 

KNZ184 

LAV410 

LRL117 

LYK123 

MAC426 

MCK131 

MCK231 

MEV405 

MKG113 

MOR409 

OXF122 

PAR107 

PED108 

Baiyc. 




Figure 4: Ozone dataset: monitored sites from 27 to 52. 





Ozone Concentration Temperature Solar Radiation 


Wind Speed 




(FinPpb) 


(Xi in Celsius deg.) (X2 Watt/m^) 


(X3 m/s) 


Mean (BD) 


41.2147 


23.2805 


645.3507 


2.3488 


Barycentcr mean (VI) 


41.2147 


23.2805 


645.3507 


2.3488 


Barycenter std (VI) 


9.9680 


3.7641 


225.7818 


1.0987 


Standard dev. (BD) 


13.790 


5.3787 


252.6736 


1.7125 


Standard dev. (VI) 


9.5295 


3.8422 


113.4308 


1.1337 






Correlations 








Billard-Diday 


Verde-Irpino 








^3 Xi 


X2 X^ 




Y 


0.2328 0.4064 


0.2951 0.2473 


0.6392 0.4020 




Xi 


0.2622 


0.0621 


0.4537 0.1429 




X2 




0.3013 


0.4394 





Table 6: Ozone dataset: summary statistics 
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SITEID 
PET427 

PIN414 

PND165 
PNF126 
PRK134 
PSU106 
QAK172 
ROM406 
SAL133 
SAN1S9 
SEK:430 
SHN418 
SND152 
SPDlll 
STK138 
SUM156 
THR422 
UVL124 
VIN140 
VOY413 
VPI120 
WNC429 
WSP144 
WST109 ^ 
YEL408 ^ 
YOS404 
Bajyc. 




Figure 5: Ozone dataset: monitored sites from 53 to 78. 
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Y: Ozone concentration (ppb) Xi: Temperature (Cel. degrees) 




20 40 60 15 20 25 30 



X2: Solar Radiation (IFa^^/m^) X:^: Wind Speed (m/sec) 




200 400 600 800 2 4 6 



Figure 6: Ozone dataset. Barycenters 

right skewed distributions. 

Using the full dataset we estimated the three models and the associated goodness of fit diag- 
nostics as reported in tables 7, 8 and 9. 

Also in this case, we performed a bootstrap estimates of the pentameters and of the goodness 
of fit measures of the three models in tables 7, 8 and 9. 

Observing the goodness of fit measures of the three models we can conclude that the Irpino- 
Verde and the Dias-Brito model fit better the linear regression relationship than the Billard-Diday 
model, and the Irpino- Verde model is sUghtly more accurate than the Dias-Brito one. Also in this 
case, the Irpino- Verde model parameters give an easier interpretation. Reading the Irpino- Verde 
bootstrapped model, we may assert that the Ozone concentration distribution of a site depends 
from the mean solar radiation where for each AWatt/m^ a Q.QlQ{ppb) variation of the Ozone 
concentration mean level it is expected, while in general we cannot say that the mean level of 
temperature and of wind speed induces a significant variation of the ozone concentration level 
(the 95% bootstrap confidence intervals include zero). Furthermore, we may say that the vari- 
abiUty of the ozone concentration is quite the same as the temperature (0.928), a unit variation 
in the variabiUty of the solar radiation induces a variation of O.Oliippb) and a variation of the 
variability of the Wind Speed causes an increase in the variability of 1 .95%{ppb). Similar conclu- 
sions can be derived reading the Dias-Brito model even if it gives a different interpretation of the 
parameter associated with the symmetric histogram variables. 
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Billard-Diday Model: 







.V/ = 


A) +/3|-v,i 


+ ^2-V, 


2 +AvV',3 










Model parameters 






Goodness of tit 








P2 




D. 


Pseudo - 


RMSEw 


Observed 


18.28 


0.357 


0.017 


1.550 


0.203 


0.024 


83.13 








Bootstrap 


estimates 






Mean 


18.96 


0.323 


0.018 


1.463 


0.215 


0.103 


81.91 


Bias 


0.678 


-0.034 


0.001 


-0.087 








SE 


5.077 


0.182 


0.005 


0.652 








2.5% 


9.52 


-0.014 


0.007 


0.147 


0.070 


0.000 


68.59 


97.5% 


28.89 


0.697 


0.027 


2.836 


U.35U 


0.372 


96.85 



Table 7: Ozone dataset: BIUard-Diday model parameters estimated on the fUU dataset and bootstrapping the dataset. 



Dias-Brito model: 
yi(t) =00 +0ixn(t) +02Xa(t) +hxBit) +0ixn(t) +02Xa{t) +%XB{t) 



Model parameters Goodness of fit 





A) 




h 


h 




h 


h 


Q 


Pseudo - 


RMSEw 


Observed 


13.32 


0.000 


0.037 


1.691 


0.000 


0.000 


0.000 


0.670 


0.371 


66.74 












Bootstrap estimates 








Mean 


14.22 


0.117 


0.034 


1.709 


0.080 


0.000 


0.002 


0.712 


0.358 


65.07 


Bias 


0.905 


0.117 


-0.003 


0.018 


0.080 


0.000 


0.002 








SE 


4.760 


0.161 


0.004 


0.610 


0.112 


0.000 


0.025 








2.5% 


5.409 


0.000 


0.026 


0.602 


0.000 


0.000 


0.000 


0.625 


0.220 


49.58 


97.5% 


24.46 


0.540 


0.040 


3.070 


0.391 


0.000 


0.000 


0.801 


0.498 


80.60 



Table 8: Ozone dataset: Dias-Brito model parameters estimated on the fuU dataset and bootstrapping the dataset. 



Irpino- Verde model: 

Ut) =00 +hxii +hxi2 +hxB + yi4i(0 + 724(0 + 734^(0 

Model parameters Goodness of fit 









P2 


j83 


fi 


Ji 


73 


Q 


Pseudo - R^ 


RMSEw 


Observed 


2.928 


-0.346 


0.070 


0.395 


0.915 


0.018 


1.887 


0.742 


0.460 


61.82 












Bootstrap estimates 








Mean 


3.108 


-0.353 


0.070 


0.363 


0.928 


0.018 


1.958 


0.758 


0.474 


59.43 


Bias 


0.181 


-0.008 


0.000 


-0.032 


0.013 


0.000 


0.071 








SE 


7.180 


0.271 


0.010 


0.823 


0.237 


0.003 


0.542 








2.5% 


-11.24 


-0.846 


0.052 


-1.186 


0.482 


0.012 


1.054 


0.675 


0.296 


46.52 


97.5% 


18.87 


0.173 


0.090 


2.030 


1.377 


0.024 


3.115 


0.829 


0.625 


73.29 



Table 9: Ozone dataset: Irpino- Verde model parameters estimated on the full dataset and bootstrapping the dataset. 
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6. Conclusions 

The paper present a novel linear regression technique for data described by probability-like 
distributions, using their quantile functions and the ordinary least squares method based on the 
Wasserstein distance. Considering the nature of the data we proposed to use a particular de- 
composition of the Wasserstein distance for the definition of the regression model. We have 
also furnished an alternative goodness of fit index which takes into account the differences in 
shape and size of the quantile distributions of the independent variables. The proposed model 
corroborated on the above examples seems to have a better performances, in terms of goodness 
of fit, with respect the two main approaches presented in the hterature. Further it allows an eas- 
ier interpretation of the results. We also showed that the method can be used with a variety of 
numerical probabilistic symbolic data. Considering the complexity of the error term, the classic 
parameter inferential properties can not straightforward be extended to the regression of NPSD. 
We consider to address new efforts in the direction of investigate the properties of the involved 
estimators. 



Appendix A. OLS solution details 

Here we illustrate the OLS main passages 



55(B,r) = e^e = 

= [y - X+B - XT]^ [y - X+B - XT] = 

= Y^Y - Y^'X+B - Y^X<=r - B^X^Y + B^X^ + XB+ 

-i-B^x^xT - r^x'^^Y + r^x'^XB + r^x'^x'^r = 

= Y^Y - 2B^X^:Y + B^X^XB + 2B^X^:X'r-H 

-2r^x'^^Y -I- r^x'^xT 



first order conditions 



sss 



^ = -2X^Y -I- 2XiX+B -I- 2X^.X'^r = 
^ = -i-2B''X^X' - 2X'Ty + 2X'Txcr = o 



= 0^-2 X^Y -h2X^X+B + 2 X^X' r = ^ B = (x^X+) X^Y 

XlY 

= ^ 2B^X^X'^ -2X^-H2X'^X'^r = ^ T = (x'^^X')"' X'^Y*^ 



6B 

6SS 



(A.l) 



Appendix B. Sum of square error decomposition 

Considering 1 = [l]„xi. SSY can be written as: 



SSY = n-sj = Xdl(yiit), m) = Z / biit) - mf dt = 

i=l !=1 



Y + Y^-l^/(0) 

Y 
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further 



SSY ^ /(y - lyit) + Y - y)^ (y - lyit) + Y-Y)dt-- 





= / 





Y-Y-(lKO-Y(0) 



Y(0-Y(0 -(ly(0-Y(f)) 

V s(t) 



dt = 



sitfsit) + {im - Y(t)f {lyit) - Y(r)) -2 [lyit) - Y(t)f e(t) 



SSE 



SSR 



dt 



SSE + SSR-lj (lyit) - Yit)f s{t)dt ■■ 







SSE+SSR-2 



1 1 

J mtyf s{t)dt- J {Y(t)f 





s(t)dt 



(/) 



(//) 



for (I) we have 



r ' 

Hnximf e{t)dt = / {ly{t)f (Y(0 - X^B - X'(f)T) dt ■■ 



= l^Yit)dt- J(yit)f l^X^Bdt- J (yit)f l^^%t)Tdt = 



1 1 p 

= / (yit)f n ■ y{i)dt -nf-f (yit)f nX'^itWdt = n ■ + nf - nf - n^. Jpry-xjCTyCT-^j 





= n-(rj-nZ ypryxjO-y(Txj = « • cr? - ^ ypryj^.a-ycr^. 



for (II) we have 
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■ J [^(t)f s(t)dt = - / (x+B + x%t)rf (y(o - x+B - x^(or) dt ■■ 



ill) 



= - / (x+b)^ \{t)dt + j (x+b)^ X+Bdt + j (x+b)^ X%t)Tdt+ 



1 1 1 

- / (x^(or)^ (Y(0) dt + J {x%t)TY (x/3) dt + j iX'it)TY {X'(t)r) dt ■ 



(X'(t)r)dt+ 



W08)=O 



r 

-JiX%t)Tf(Yit))dt+ (X'it)Tf{x/3)dt + j(X'{t)rf(X'^it)T)dt 
J 



1 I 

= - / (X'^(t)rf(Y(t))dt + J(x%t)rf(X'^(t)r)dt = 



1 1 
= - / (X'^(Or)^ (y + Y'it)) dt + J iX'it)Tf (X'it)T) dt = 



r 

= - (x'(t)r)^Ydt-J(X'(t)r)'^Y%t)dt + J(X'it)Tf(x%t)r)dt = 

Jo 



1 1 

/(x'„,r,'(x'(,)r,.,-/(x'(,)r,'v'(,M, = rv/(r) 





Oif using OLS 
1 



/ (X'^(/)r)^ {x%t)r)dt - / (X"(/)r)^ Y'={t)dt ■■ 



1 

= rj [iX%t)fX%t)T-{X%t)fY'(t)\dt 



v/(r) 



bias = -2 





p 




n ■ 




+ rv/(r) 




y 7=1 ) 
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the decomposition is then: 



SSY ^SSE + SSR-2 



n • 



0"i 



1' 

-z 



rjryxjO-yO-^. 



7=1 



+ rv/(r) 
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